We trained the XGBoost classifier with just the objective ='binary:logistic' for XGBoost Baseline. For training and validation setup, it is similar to what we have in the Logistic regression Baseline model. First, we split the training dataset into an 80:20 ratio with the same random state for training and validation. Then we fit the model with training data and then predict the target feature. We also get the predicted probabilities for the validation data. Later we create graphs as per requirement. Comparing with results for XGB Baseline and Logistic regression baseline models trained on features “Distance” and “Angle”, As we know that in Logistic Regression there was no prediction for Goal but here in XGBoost Baseline we see that there are predictions for Goal. But we also see that fro XGBoost Baseline even though we have predictions for Goal the recall is still 0.00. Furthermore the ROC-AUC score for both the Logistic Regression model is 0.68 and for XGBoost baseline is 0.71.
If we see the figures we have for baseline. There is not much difference for ROC-AUC curve, Goal rate curve and Cumulative goal curve. But if we see the reliability curve we can see a huge difference. Here we can see that we have predictions varying from from 0.0 to 1.0 and in the beginning the model was close to the perfectly collaborated line but got a lot of variance later as the probability increases.
For XGBoost classifier hyper tuning we used the library called “optuna”. It's an open-source hyperparameter optimization framework to automate hyperparameter search.
For this library, we have to set up an objective that has details of what parameters we are going to tune, what is the classifier and what is the evaluation matrix we are using. We had used the below details:
Hyperparameter and their range for tuning:
For cross validation we we had the sklear cross-validation where we used two metrics for evaluation:
Based on the above figures we can see that the when we tune hyper-parameter with evaluation matric as “f1” more importance is given to the learning_rate parameter. And others have very less importance comparatively. But when we evaluate based on the “roc_auc score” the importance are more substantially divided between parameters.
However when we saw the classification report for both the models after the hyper-parameter tunning except the recall we saw no substantial difference in the report. The model tuned with evaluation matric as “roc-auc score” had the recall for goal as 0.02 and for the model tuned with the evaluation matric as “f1” had the recall for goal as 0.05. Hence we decided to go with the model tuned with the evaluation matric as “f1” which gave the below optimized parameters:
Optimized parameters: {'n_estimators': 166, 'max_depth': 9, 'alpha': 13, 'colsample_bytree': 0.9943925059480794, 'learning_rate': 0.3}
In the above figures we see that ROC-AUC curve and Cumulative percentage of goals curve doesn’t have much difference. But if we see the Goal rate curve we can see that for the higher percentile we have a higher goal rate compared to XGBoost baseline. Also in the Calibration curve, we can see that now the model is more calibrated than the XGBoost baseline. Now for matrics, we saw that in XGBoost baseline the recall was 0.00 for goal and roc_auc score was 0.71. But after hyperparameter tuning, we can see that the recall for goal has increased to 0.05 and the roc_auc score is 0.76.
For Feature selection, we tried sklearn SelectFromModel. We had the importance of the features in sorted order and then we tried to train the model based on the feature importance in increasing order. That means we tried with all features first and then kept increasing the importance value to select fewer features for every iteration and train the model based on those features. We can see in the below figure the accuracy of the model based on the number of features.
As we can see there was not much change in the accuracy for any number of features. Also in the importance diagram that we created using SHAP library we see that except “Time from the last event” which has very less impact on the accuracy and all other contribute to the increase in the accuracy of the model.
Thus we concluded that all features are optimum for the model and there is no need to remove any features.